Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Dr. C. Parthasarathy, Shivanesh P, Vedha Vigneshwar T
DOI Link: https://doi.org/10.22214/ijraset.2025.75870
Certificate: View Certificate
Approximate membership query (AMQ) structures represented by the Bloom Filter and its variants have been popularly researched in recent years. Researchers have recently combined machine learning with this type of structure to reduce space consumption and computation overhead further and make remarkable progress. However, with the booming performance in space or other metrics, researchers tend to ignore the security of the trained model. The machine learning model is vulnerable to poisoning attacks, and naturally, we infer that the learning-based filters also have the same deficiencies. Hence, in this paper, to confirm the inference mentioned above, experiments on the real-world datasets of URLs are conducted and prove that it is necessary to consider the security issue when using learning-based filters. We show that by data poisoning, the attacker can deflect learned Bloom Filters to make a false identification, which can lead to a significant loss in some cases. Aiming to solve this issue, we put forward a method named Defensive Learned Bloom Filter (DLBF) to diminish the influence of data poisoning and achieve a better performance compared to types of learned Bloom Filters.
When Google Chrome checks whether a URL is safe, relying on Google’s central servers for every query creates latency, dependency, and scalability issues. Embedding a local URL blacklist in Chrome mitigates these issues, but storing millions of URLs directly requires excessive space. Bloom Filters, a probabilistic and space-efficient data structure, solve this problem by enabling fast (O(1)) membership checks with no false negatives, at the cost of occasional false positives. For example, 10? URLs stored with a 1% false-positive rate require only about 114 MB, far less than storing the raw URLs.
Although standard Bloom Filters are widely used in caching, IP traceback, virus scanning, and safe browsing, they still consume more space than theoretically optimal. To improve space efficiency, Learned Bloom Filters (LBFs) combine machine learning classifiers with a backup Bloom Filter. By leveraging patterns in the data, LBFs and their variants—SLBF, Ada-BF, and PLBF—significantly reduce memory usage. For malicious URL detection, learned Bloom Filters can save up to 60% more space than standard Bloom Filters while supporting updates and generalizing to previously unseen URLs.
However, integrating machine learning introduces new security vulnerabilities, particularly data poisoning attacks. An attacker can inject crafted data into the training set to manipulate the classifier, causing benign URLs to be falsely labeled as malicious. This leads to excessive server requests, degraded performance, or incorrect blocking of legitimate traffic. Learned Bloom Filters used in other applications such as spam filtering and database acceleration face similar risks.
Experiments on four real-world malicious URL datasets show that LBFs and all major variants are vulnerable to poisoning attacks. To address this, the authors propose the Defensive Learned Bloom Filter (DLBF), a new framework that reduces the influence of poisoned data and optimizes hyperparameters to resist such attacks. DLBF outperforms both standard and learned Bloom Filters under normal and adversarial conditions.
The remainder of the paper introduces Bloom Filter preliminaries, analyzes poisoning attacks in detail, presents the DLBF framework, and discusses related work.
In this paper, we reveal the vulnerability of learning-based Bloom Filters to data poisoning attacks for the application of malicious URL detection. With a small number of crafted examples, an attacker can contaminate the identification of learning-based filters, which can lead to an additional cost in some cases. Hence, we propose a novel method to diminish the bad effect of poisoning attacks named DLBF, and experiments demonstrate that our method makes sense on both occasions where the trained data is poisoned and not. Thus, we hope that the security factor can be considered when designing a new type of learning-based filter, and our framework can be used in practical applications or act as an enlightenment for the same question. Additionally, we hope that the newly proposed framework DLBF can be considered in the safe browsing application. In future work, it is interesting to study the security of the learning-based structures and find methods to eliminate the underlying issue if it exists.
[1] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [2] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable wide-area web cache sharing protocol,” IEEE/ACM transactions on networking, vol. 8, no. 3, pp. 281–293, 2000. [3] A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchak- ountio, S. T. Kent, and W. T. Strayer, “Hash-based ip traceback,” ACM SIGCOMM Computer Communication Review, vol. 31, no. 4, pp. 3–14, 2001. [4] O. Erdogan and P. Cao, “Hash-av: fast virus signature scanning by cache-resident filters,” International Journal of Security and Networks, vol. 2, no. 1-2, pp. 50–59, 2007. [5] E. H. Spafford, “Opus: Preventing weak password choices,” Com- puters & Security, vol. 11, no. 3, pp. 273–278, 1992. [6] A. Broder and M. Mitzenmacher, “Network applications of bloom filters: A survey,” Internet mathematics, vol. 1, no. 4, pp. 485–509, 2004. [7] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case for learned index structures,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 489–504. [8] M. Mitzenmacher, “Optimizing learned bloom filters by sand- wiching,” arXiv preprint arXiv:1803.01474, 2018. [9] Z. Dai and A. Shrivastava, “Adaptive learned bloom filter (ada-bf): Efficient utilization of the classifier with application to real-time information filtering on the web,” Advances in neural information processing systems, 2020. [10] K. Vaidya, E. Knorr, M. Mitzenmacher, and T. Kraska, “Partitioned learned bloom filters,” in International Conference on Learning Rep- resentations, 2020. [11] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012, pp. 1467–1474. [12] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li, “Manipulating machine learning: Poisoning attacks and countermeasures for regression learning,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 19–35. [13] X. Zhang, X. Zhu, and L. Lessard, “Online data poisoning attacks,” in Learning for Dynamics and Control. PMLR, 2020, pp. 201–210. [14] J. Steinhardt, P. W. Koh, and P. Liang, “Certified defenses for data poisoning attacks,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 3520–3532. [15] J. Yan and P. L. Cho, “Enhancing collaborative spam detection with bloom filters,” in 2006 22nd Annual Computer Security Applications Conference (ACSAC’06). IEEE, 2006, pp. 414–428. [16] N. Dayan, M. Athanassoulis, and S. Idreos, “Optimal bloom filters and adaptive merging for lsm-trees,” ACM Transactions on Database Systems (TODS), vol. 43, no. 4, pp. 1–48, 2018. [17] P. Reviriego, J. A. Herna´ndez, Z. Dai, and A. Shrivastava, “Learned bloom filters in adversarial environments: A malicious url detec- tion use-case,” in 2021 IEEE 22nd International Conference on High Performance Switching and Routing (HPSR). IEEE, 2021, pp. 1–6. [18] F. Deng and D. Rafiei, “Approximately detecting duplicates for streaming data using stable bloom filters,” in Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006, pp. 25–36. [19] Q. Liu, L. Zheng, Y. Shen, and L. Chen, “Stable learned bloom filters for data streams,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2355–2367, 2020. [1] “Spam urls classification dataset – Kaggle,” 2022. [On- line]. Available: https://www.kaggle.com/datasets/shivamb/ spam-url-prediction [2] “Malicious and benign urls dataset – Kaggle,” 2019. [Online]. Available: https://www.kaggle.com/datasets/ siddharthkumar25/malicious-and-benign-urls [3] “Malicious and non-malicious urls dataset – Kaggle,” 2018. [Online]. Available: https://www.kaggle.com/datasets/ antonyj453/urldataset [4] “Phishing sites urls dataset – Kaggle,” 2020. [Online]. Available: https://www.kaggle.com/datasets/taruntiwarihp/ phishing-site-urls [5] V. Grari, B. Ruf, S. Lamprier, and M. Detyniecki, “Achieving fair- ness with decision trees: An adversarial approach,” Data Science and Engineering, vol. 5, no. 2, pp. 99–110, 2020. [6] L. Liu, O. De Vel, Q.-L. Han, J. Zhang, and Y. Xiang, “Detecting and preventing cyber insider threats: A survey,” IEEE Communications Surveys & Tutorials, vol. 20, no. 2, pp. 1397–1417, 2018. [7] F. Trame`r, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction {APIs},” in 25th USENIX security symposium (USENIX Security 16), 2016, pp. 601– 618. [8] M. Velez, P. Jamshidi, N. Siegmund, S. Apel, and C. Ka¨stner, “White-box analysis over machine learning: Modeling perfor- mance of configurable systems,” in 2021 IEEE/ACM 43rd Interna- tional Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1072–1084. [9] S. Hong, V. Chandrasekaran, Y. Kaya, T. Dumitras¸, and N. Pa- pernot, “On the effectiveness of mitigating data poisoning attacks with gradient shaping,” arXiv preprint arXiv:2002.11497, 2020. [10] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machine learning,” Machine Learning, vol. 81, no. 2, pp. 121–148, 2010. [11] Z. Dai, A. Shrivastava, P. Reviriego, and J. A. Herna´ndez, “Op- timizing learned bloom filters: How much should be learned?” IEEE Embedded Systems Letters, 2022. [12] E. Wallace, T. Zhao, S. Feng, and S. Singh, “Concealed data poi- soning attacks on nlp models,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 139–150. [13] M. Jagielski, G. Severi, N. Pousette Harger, and A. Oprea, “Sub- population data poisoning attacks,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, pp. 3104–3122. [14] A. Chan, Y. Tay, Y.-S. Ong, and A. Zhang, “Poison attacks against text datasets with conditional adversarially regularized autoen- coder,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, 2020, pp. 4175–4189. [15] M. Li, D. Chen, H. Dai, R. Xie, S. Luo, R. Gu, T. Yang, and G. Chen, “Seesaw counting filter: An efficient guardian for vulnerable nega- tive keys during dynamic filtering,” in Proceedings of the ACM Web Conference 2022, 2022, pp. 2759–2767. [16] A. L. Montgomery and C. Faloutsos, “Identifying web browsing trends and patterns,” Computer, vol. 34, no. 7, pp. 94–95, 2001. [17] S. Bell and P. Komisarczuk, “An analysis of phishing blacklists: Google safe browsing, openphish, and phishtank,” in Proceedings of the Australasian Computer Science Week Multiconference, 2020, pp. 1–11. [18] Google. (2022) Safe browsing. [Online]. Available: https: //safebrowsing.google.com/ [19] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher, “Cuckoo filter: Practically better than bloom,” in Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, 2014, pp. 75–88. [20] H. Lan, Z. Bao, and Y. Peng, “A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration,” Data Science and Engineering, vol. 6, no. 1, pp. 86– 101, 2021. [21] C. E. Rothenberg, C. A. Macapuna, F. L. Verdi, and M. F. Mag- alhaes, “The deletable bloom filter: a new member of the bloom family,” IEEE Communications Letters, vol. 14, no. 6, pp. 557–559, 2010. [22] A. Kirsch and M. Mitzenmacher, “Distance-sensitive bloom fil- ters,” in 2006 Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 2006, pp. 41–50. [23] J. Bruck, J. Gao, and A. Jiang, “Weighted bloom filter,” in 2006 IEEE International Symposium on Information Theory. IEEE, 2006, pp. 2304–2308. [24] T. Gerbet, A. Kumar, and C. Lauradoux, “The power of evil choices in bloom filters,” in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 2015, pp. 101–112. [25] D. Clayton, C. Patton, and T. Shrimpton, “Probabilistic data struc- tures in adversarial environments,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019, pp. 1317–1334. [26] P. Reviriego and O. Rottenstreich, “Pollution attacks on counting bloom filters for black box adversaries,” in 2020 16th International Conference on Network and Service Management (CNSM). IEEE, 2020, pp. 1–7. [27] P. Reviriego, O. Rottenstreich, S. Liu, and F. Lombardi, “Analyzing and assessing pollution attacks on bloom filters: Some filters are more vulnerable than others,” in 2021 17th International Conference on Network and Service Management (CNSM). IEEE, 2021, pp. 491– 499. [28] J. Rae, S. Bartunov, and T. Lillicrap, “Meta-learning neural bloom filters,” in International Conference on Machine Learning. PMLR, 2019, pp. 5271–5280. [29] A. Bhattacharya, S. Bedathur, and A. Bagchi, “Adaptive learned bloom filters under incremental workloads,” in Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, 2020, pp. 107–115.
Copyright © 2025 Dr. C. Parthasarathy, Shivanesh P, Vedha Vigneshwar T. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET75870
Publish Date : 2025-11-27
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here
Submit Paper Online
